-
Notifications
You must be signed in to change notification settings - Fork 240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a random seed specific to datagen cases #9441
Add a random seed specific to datagen cases #9441
Conversation
Signed-off-by: Alessandro Bellina <[email protected]>
build |
I have updated the test names to have the seed in the test name, here's an example:
|
build |
If I could have everything I wanted I would change the datagen so the seed is not passed into it at all. Then if someone wants to override the seed they use an annotation to force it to a specific seed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good generally.
@@ -113,7 +113,6 @@ def is_parquet_testing_tests_forced(): | |||
_inject_oom = None | |||
|
|||
def should_inject_oom(): | |||
global _inject_oom |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the global not needed any more?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is something I owed @gerashegalov from a long time ago #7925 (comment).
Since we don't need to change the variable, it's not needed.
If you had multiple datagens with the argument approach one can set a seed for each datagen invocation. With an annotation, I'd have to check if a function invocation can be annotated, though at that point I am not sure there's much point to that (though I could be wrong) |
I am seeing a failure in CI:
So I'll take a look at reproing this locally. |
This is another manifestation of #9350. That said, that's the only test that failed in this CI run. If we did have that parameter annotation like @revans2 suggested, we could at least print in the test that we meant to override the seed... I'll check on that. But worse case I may skip this test since we have a ticket for it. |
BTW, the test failure is easy to repro via:
|
… @datagen_overrides(seed=12356)
build |
@firestarman mind taking a look at this diff 7fc82d6, I am not sure if the random with seed=0 was set on purpose. |
build |
I probably just copied it from the old code and it is ok to remove it if no failures are found. |
build |
1 similar comment
build |
Looks like there are still some more tests that need to be fixed/xfailed + issues filed before we can check this in. |
Yes, there's a list of issues that are failing in CI, some I can repro locally, some I can't. I should be able to get back to this next week. |
…into add_test_seed_for_datagen
Follow on issue: #9703 |
build |
build |
build |
1 similar comment
build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally looks good. I just want to be sure that there is a follow on issue for every test that has a hard coded seed, or a good explanation as to why it is hard coded.
@@ -91,11 +91,12 @@ def do_join(spark): | |||
@pytest.mark.parametrize('data_gen', all_gen, ids=idfn) | |||
@pytest.mark.parametrize('enable_vectorized_conf', enable_vectorized_confs, ids=idfn) | |||
@ignore_order | |||
@datagen_overrides(seed=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why no reason here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh whoops, let me add that.. not intended
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am re-running CI with these removed. I can't repro these locally.
|
||
#sort locally because of https://github.com/NVIDIA/spark-rapids/issues/84 | ||
# After 3.1.0 is the min spark version we can drop this | ||
@ignore_order(local=True) | ||
@datagen_overrides(seed=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is the reason for these?
data_gen.start(rand) | ||
data = [data_gen.gen() for index in range(0, length)] | ||
return data | ||
|
||
def gen_df(spark, data_gen, length=2048, seed=0, num_slices=None): | ||
def gen_df(spark, data_gen, length=2048, seed=None, num_slices=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want a follow on issue to remove seed for gen_df and force us to go through the annotation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
build |
1 similar comment
build |
build |
build |
Closes #9282
Please note that there are a number of overrides in the integration tests, and I wasn't planning on changing these, but I can I need to look into why the are seeding from 0 each time.
src/main/python/collection_ops_test.py: step_gen.start(random.Random(0))
src/main/python/collection_ops_test.py: gen.start(random.Random(0))
src/main/python/json_fuzz_test.py:_name_gen.start(random.Random(0))
src/main/python/cache_test.py: def op_df(spark, length=2048, seed=0):
src/main/python/generate_expr_test.py:def four_op_df(spark, gen, length=2048, seed=0):
src/main/python/expand_exec_test.py: def op_df(spark, length=2048, seed=0):
I did not add the random seed to the test name, but I easily can. The reason I was leaning against this is that the user can easily set the seed, so the seed in the name of the test wouldn't match if they have changed something internally. Every test would use the same seed, and it's overwrite-able via environment variable or via the
seed
parameter in datagen.The variable we can use to override this in manual runs is:
SPARK_RAPIDS_TEST_DATAGEN_SEED